pic
Personal
Website

Performance Improvements in Julia (Part 1): Lazy Broadcasting

Martin Alfaro - PhD in Economics
June 12, 2023
Remark
The script in this post is available here under the name allCode.jl. It's been tested under Julia 1.9.

For more accurate benchmarks, each variable is wrapped in ref(x) = (Ref(x))[] and interpolated through $.

Introduction

Readability and performance are two crucial features in coding. However, they're not always compatible, leading you to prioritize one over the other. In this post, we'll explore a tool that enables you to enhance performance without compromising readability: lazy broadcasting.

In programming, the term lazy refers to postponing the execution of some code until the result is required. The opposite is eager, which means executing the code as soon as the computer reads it. When it comes to broadcasting, Julia performs it eagerly by default. This post will show how lazy broadcasting can enhance performance.

Lazy broadcasting is useful for computing an output that requires several intermediate steps, as long as intermediate results are unimportant and we don't need to keep them. This improves performance by i) avoiding memory allocations for intermediate results, [note] Strictly speaking, we're referring to heap allocations. The concept involves several technical aspects that are not necessary for our purposes. See here if you want to know why this matter for performance, and here for an intuitive explanation of heap and stack allocations. and ii) combining operations into a single one, which enables Julia to perform further optimizations.

Motivating Example

To facilitate the explanation, let's consider a concrete example. This is based on Economics, but don't worry if you're not an economist, as it doesn't require any specialized knowledge.

The goal is to compute a company's profit for a given price of its product. Profit is defined as the difference between revenue and cost. In turn, revenue is the product of price and quantity, whereas cost is the cost per unit times the quantity produced. Therefore, if our aim is to compute profits for some given prices, we need to compute quantities, revenues, and costs as intermediate steps.

Formally, given some price, we need to compute the quantity demanded: q(p):=10βˆ’2.1Γ—p q\left(p\right):=10-2.1\times p

which makes it possible to compute the revenue and costs: r(q,p):=qΓ—p r\left(q,p\right):=q\times p c(q):=1.1Γ—q c\left(q\right):=1.1\times q

providing us with the profit Ο€(r,c):=rβˆ’c \pi\left(r,c\right):=r-c

Our task at hand is to compute (4), with (1), (2), and (3) as intermediate results we won't eventually need.

Comparing Approaches

The following packages are required for the code in this post. Note in particular the use of LazyArrays, which is what allows us to use lazy broadcasting.

Packages for the Post
using DataFrames, Distributions, Random, BenchmarkTools
using LazyArrays

The code below generates mock data for prices. It additionally creates functions in Julia representing equations (1), (2), (3), and (4).

Mock Data and Functions
Random.seed!(123)        # setting the seed for reproducibility
price_vector             = 2 .* rand(1_000)

quantity(price)          = 10 - 2.1 * price
revenue(quantity, price) = quantity * price
cost(quantity)           = 1.1 * quantity
profit(revenue, cost)    = revenue - cost

We now compare the performance of two broadcasting approaches. The first one is the default approach, which performs broadcasting eagerly. The second one performs the broadcast lazily, which we indicate by adding the macro @~. [note] The benchmarks in this post wrap each variable into the function ref(x) = (Ref(x))[]. This prevents the computer from realizing that we're repeating the same calculation. Otherwise, @btime would directly repeat stating the result after a few iterations rather than computing it, thus distorting the comparison.

function compute_profit1(price)
    compute_quantity = quantity.(price)
    compute_revenue  = revenue.(compute_quantity, price)
    compute_cost     = cost.(compute_quantity)
    
    profit.(compute_revenue, compute_cost)     
end

@btime compute_profit1(ref($price_vector));
Output in REPL
  1.070 ΞΌs (4 allocations: 31.75 KiB)
function compute_profit2(price)
    compute_quantity = @~ quantity.(price)
    compute_revenue  = @~ revenue.(compute_quantity, price)
    compute_cost     = @~ cost.(compute_quantity)
    
    profit.(compute_revenue, compute_cost)
end

@btime compute_profit2(ref($price_vector));
Output in REPL
  233.333 ns (1 allocation: 7.94 KiB)

As 1 ΞΌs is 1,000 ns, the lazy approach is faster. This occurs because the default broadcasting performs eager computations. Thus, compute_profit1 stores the results for quantities, revenue, and cost, and then computes the profit with those results. The procedure ends up performing 4 allocations: three for the intermediate steps, and one for the output of interest (i.e., profit).

In contrast, compute_profit2 indicates Julia that quantities, revenue, and cost should be calculated lazily. It's like saying to Julia: "do not compute anything at this point, let me only describe the operation involved". This allows Julia to avoid the allocations for the intermediate results, fuse all the operations into a single one, and provide the profit as the outcome. This explains why we've reduced 4 allocations to 1, where the only allocation is storing the vector of profits and hence is unavoidable.

Remark
Notice that the allocations would've occurred even if we hadn't stored the results in variables. This is because any expression that relies on broadcasting allocates, even if the final result is a scalar (which does not allocate). The following example demonstrates this.

x        = [2,3,4]
eager(x) = sum(x .* x .* log.(x))

@btime eager(ref($x))
Output in REPL
  30.955 ns (1 allocation: 80 bytes)
x       = [2,3,4]
lazy(x) = sum(@~(x .* x .* log.(x)))

@btime lazy(ref($x))
Output in REPL
  29.819 ns (0 allocations: 0 bytes)

When to Use Lazy Broadcasts?

A lazy approach is only preferred when the intermediate results aren't needed for any other computation. If, for instance, quantities are necessary to computer other results besides profits, storing them will likely improve performanceβ€”the detrimental effect of storing the result would be more than compensated for by avoiding the computation of quantities multiple times.

Furthermore, lazy broadcasting should only be applied to performance-critical parts of the code, or when a function is repeatedly used across your projects. Otherwise, the time spent on writing the code might not be worth the performance gain.

Lazy broadcasting could also be useful if you have a tight loop or, in the same vein, for computations on subgroups in a DataFrame. To illustrate this, let's consider a hypothetical dataset containing information on companies, spanning several years and countries. In particular, let's create some mock data for years 2000-2020 and 50 countries, each with 10,000 companies. This results in a DataFrame with 10,500,000 observations. If you're interested in seeing how the data was constructed, you can open the code below.

Mock Data
#80-20 Pareto Rule within country-year
prices_random(nr_firms, nr_countries) = [rand(Pareto(log(4, 5)), nr_firms) for _ in 1:nr_countries] 

function generate_yearly_data(year, nr_firms, nr_countries)
    revenue = prices_random(nr_firms, nr_countries)
    dff     = DataFrame(country=1:nr_countries, price = revenue)
    dff     = flatten(dff,2)

    dff.firm  = repeat(1:nr_firms, outer = nr_countries)
    dff.year  .= year
    return dff
end


function generate_data(nr_years, nr_firms, nr_countries)    
    dff = generate_yearly_data(2000, nr_firms, nr_countries)
    for year in 2001:nr_years-1+2001
        append!(dff, generate_yearly_data(year, nr_firms, nr_countries)) 
    end   

    return select!(dff, [:year, :country, :firm, :price])
end

dff = generate_data(nr_years, nr_firms, nr_countries) #it generates the DataFrame

Ultimately, what matters is that the DataFrame looks like the following.

Output in REPL
julia> dff[1:5,:]
5Γ—4 DataFrame
 Row β”‚ year   country  firm   price
     β”‚ Int64  Int64    Int64  Float64
─────┼────────────────────────────────
   1 β”‚  2000        1      1  2.75317
   2 β”‚  2000        1      2  4.54952
   3 β”‚  2000        1      3  2.38347
   4 β”‚  2000        1      4  1.35309
   5 β”‚  2000        1      5  1.72773

Suppose we want to compute the mean profit for each country and year. This requires performing the same calculation nr_countriesΓ—nr_years=1,050nr\_countries\times nr\_years=1,050 times. In contrast to the previous example, each firm's profit is not of interest anymore, and so we can avoid storing this result too. The following comparison shows an eager and a lazy broadcast approach for computing the mean profit.

dfg = groupby(dff,[:country,:year])

function average_profit1(price)
    compute_quantity = quantity.(price)
    compute_revenue  = revenue.(compute_quantity, price)
    compute_cost     = cost.(compute_quantity)    
    compute_profit   = profit.(compute_revenue, compute_cost)

    mean(compute_profit)
end

@btime transform(ref(dfg), ref(:price) => average_profit1 => :profit)
Output in REPL
  233.661 ms (9203 allocations: 1.10 GiB)
dfg = groupby(dff,[:country,:year])

function average_profit2(price)
    compute_quantity = @~ quantity.(price)
    compute_revenue  = @~ revenue.(compute_quantity, price)
    compute_cost     = @~ cost.(compute_quantity)
    compute_profit   = @~ profit.(compute_revenue, compute_cost)

    mean(compute_profit)
end

@btime transform(ref(dfg), ref(:price) => average_profit2 => :profit)
Output in REPL
  164.001 ms (794 allocations: 801.15 MiB)